Efficient cross-validation traversals in feature subset selection

Sparse and robust classification models have the potential for revealing common predictive patterns that not only allow for categorizing objects into classes but also for generating mechanistic hypotheses. Identifying a small and informative subset of features is their main ingredient. However, the exponential search space of feature subsets and the heuristic nature of selection algorithms limit the coverage of these analyses, even for low-dimensional datasets. We present methods for reducing the computational complexity of feature selection criteria allowing for higher efficiency and coverage of screenings. We achieve this by reducing the preparation costs of high-dimensional subsets \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathscr {O}}({n}m^2)$$\end{document}O(nm2) to those of one-dimensional ones \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${\mathscr {O}}(m^2)$$\end{document}O(m2). Our methods are based on a tight interaction between a parallelizable cross-validation traversal strategy and distance-based classification algorithms and can be used with any product distance or kernel. We evaluate the traversal strategy exemplarily in exhaustive feature subset selection experiments (perfect coverage). Its runtime, fitness landscape, and predictive performance are analyzed on publicly available datasets. Even in low-dimensional settings, we achieve approximately a 15-fold increase in exhaustively generating distance matrices for feature combinations bringing a new level of evaluations into reach.


Methods
We analyze FSS in classification 41 . A classifier is a function c : X → Y for predicting the class label y ∈ Y of a given object represented by a vector of features or measurements (x (1) , . . . , x (n) ) T = x ∈ X . For simplicity we assume the feature space to be embedded in a n-dimensional real-valued space X ⊆ R n . However, the findings presented in this manuscript also hold for heterogene data representations, where each x (i) has a different scale (e.g. qualitative or quantitative) or complex data type (e.g. tensors, graphs, strings, . . . ).
A classifier is initially adapted via a set of labeled training examples T = {(x j , y j )} m j=1 and subsequently tested on an independent set of labeled validation examples V , T ∩ V = ∅ . We will use the notion c T if the classifier c is trained on the training set T . The set of all available samples will be denoted by S = T ∪ V.
FSS is an optional subprocess within the training of a classifier. Its task is the selection of a suitable feature combination from the set of 2 n − 1 possible ones. Deselected features can no longer influence the prediction of the classifier. The underlying search space of a FSS problem is illustrated Hasse diagram as shown in Fig. 1. A selected feature set of size n will be described by a sorted and repetition free index vector i = (i 1 , . . . , in) T , The corresponding reduced representation of a sample x ∈ R n will be given by x (i) = x (i 1 ) , . . . , x (in) T . The notation î = {i j }n j=1 will be used to denote the unordered set of indices.
The generalization ability of a classifier can be evaluated in a r × f cross-validation (CV) experiment 41 , where r denotes the number of runs and f the number of folds. This subsampling experiment is a summary of r independent runs of a f-fold CV experiment. A single f-fold CV divides the overall set of samples S into f folds of approximately equal size. In an iterative procedure, each of the folds is used as a validation set while the remaining folds are used for training the classifier. The mean error rate over all folds and runs is afterwards used as an estimate of the classifier's generalization error (1) i ∈ I = {i ∈ Nn ≤n | i k−1 < i k , 1 ≤ i k ≤ n}.
A special case is the 1 × |S | CV which is also called leave-one-out cross-validation (LO). Here, each sample is evaluated separately. The symbol R lo is used to denote the corresponding error estimate.
The exhaustive k-NN approach (e-k-NN). In the following, we propose two techniques for improving the runtime complexity of distance-based feature selection criteria. The first one is an efficient memoization and enumeration scheme for the calculation of distance matrices ("Distance based FSS" and "Exhaustive enumeration scheme" sections). The corresponding sequence of feature sets can be split into an arbitrary number of junks allowing an easy parallelization with individual workload for each compute note ("Parallelization" section). The second one is a fast evaluation technique for CV experiments based on k-NNs ("Nearest neighbor classification" and "Estimation of k-nearest neighbor cross-validation complexity" sections). It implies a new error bound for the CV error of k-NNs ("Cross-validation error bound" section).
We demonstrate the performance of these techniques in exhaustive FSS experiments. That is, the chosen selection criterion is evaluated for each of the 2 n − 1 feature subsets. In this case, the selection strategy is free from subsampling effects as induced by stochastic or heuristic strategies. It is the only selection strategy that is guaranteed to find a global optimum solution for any type of objective 42,43 . As a selection criterion, the minimal r × f CV error of a k-NN was chosen Note that the evaluation of i is performed on the training set of a classifier, where T i = {(x (i) , y) | (x, y) ∈ T } denotes the training set's restriction to the measurements in i . The selection process is therefore based on an internal CV. We finally evaluated this overall strategy as exhaustively feature selecting k-NN classifiers (e-k-NN) in ("Results" section. Distance based FSS. The optimization criterion given in Eq. (3) can be seen as a distance based FSS criterion. That is, the assessment of a feature subset is based on pairwise distances d(., .) measured on the training samples, where These measurements are summarized in a distance matrix D: Index i will be used to indicate the selected feature subset. In the following, we assume that this distance measure is decomposable. That is, it can be seen as sum of the feature-wise (one dimensional) distances: www.nature.com/scientificreports/ Distance measures that fulfill this criterion are for instance the (potentiated) Minkowski metrices or product distances of type d(., .) = i∈î d i (., .) , which evaluate each feature by a separate distance 44 or kernel 45 . We will utilize a variant of the Euclidean distance (no square root) as a canonical example A standard implementation for calculating this distance matrix in an |î| dimensional space can be done in time O (m 2 |î|).
As a distance-based FSS algorithm has to calculate a distance matrix for each feature subset evaluation a reduction of this computational complexity is desirable and extends the coverage of the search space of feature subsets I . For a decomposable distance the complexity can be improved by memorizing distance matrices. For example, if the distance matrices D i ′ and D i ′′ are known (e.g. from previous evaluations), where î ′ ∩î ′′ = ∅ and î ′ ∪î ′′ =î , the distance matrix D i can be calculated via a single matrix addition D i ′ + D i ′′ . This reduces the computational complexity to O (m 2 ).
Especially FFS algorithms that modify existing distance matrices by adding or removing individual features can benefit from this effect by memorizing all n feature-wise distance matrices D (i) . In this case, each modification can be seen as a single matrix addition (or subtraction). This holds for all search algorithms that directly operate on the structure of the Hasse diagram. The memorization requires once an additional time and space complexity of O (nm 2 ) . For a large number of evaluations e ≫ n we get a time complexity of against the naive time complexity of O (enm 2 ).
Exhaustive enumeration scheme. Remarkably, this reduction can be achieved by exhaustive search algorithms, if the feature signatures are processed in lexicographical order 46 . The computational complexity is than reduced to O (2 n m 2 ). Assuming that all feature-wise distance matrices D (i) are already memorized for all n one-dimensional subspaces i ∈ {1, . . . , n} , the distance matrices for any multi dimensional index vector i = (i 1 , . . . , i k ) T , can be calculated by a single matrix addition: Here i ′ denotes the parent of i . Structuring the set of all possible feature combinations according to the parental relationship leads to the construction of a search tree ( Fig. 2A). Here it can be observed that at most n additional parental nodes have to be memorized (Fig. 2B). Overall at most 2n matrices are required, which corresponds to a space complexity of O (nm 2 ).
Parallelization. In order to facilitate parallelization, the procedure described above can be modified to start at an arbitrary index vector i (Fig. 2C). Memorizing the distance matrices for the sequence of parental index vectors of i and the feature-wise matrices allows calculating the remaining distance matrices D j , i ⊏ j as mentioned before. In this way, the complete sequence of all feature signatures can be split into partial sequences of arbitrary length and can afterwards be processed in parallel. The length of the partial sequences might be chosen in order to optimize the load balancing of the available compute cores 47 . For each parallel call, a copy of the feature-wise and cumulative distance matrices are required.
The starting points for the partial sequences can be addressed directly without enumerating predecessor signatures. An example is given in Table 1. A feature signature i can be identified and reconstructed from its position id in the lexicographical order and the total number of features n. The corresponding mapping will be called f sig (id, n) = f sig (id, n, 1) in the following where (j 1 , . . . , j l ) = f sig (id − 1, i + 1, n) . The recursive formula f sig (id, i, n) will stepwise elongate the partial signature until all required feature indices are incorporated. In this context variable i can be seen as the index of the ith feature. It is initialized with i = 1 . Starting with the first element, f sig will traverse stage-wise through the search tree of the enumeration scheme. The current position of the feature signature is filled with i, if id ∈ 2, 2 n−i . If   www.nature.com/scientificreports/ id > 2 n−i , the search is continued with feature index i := i + 1 and a diminished id := id − 2 n−i . If the current position is filled, a recursive call of f sig fills the next position of the signature. This screen takes place on the next stage of the search tree. As we operate on sorted and repetition free index vectors, the next possible feature index is i := i + 1.
The position id of a signature i can also be calculated directly, if the total number of features n is known. The corresponding function will be denoted as Here we assume that |∅| = 0 , i 0 = 0 and i, p ∈ N.
Nearest neighbor classification. In the following, we concentrate on Nearest Neighbor Classifiers (k-NN) 40 as a conceptually simple example for a distance-based classifier. The k-NNs are prototype-based classifiers that utilize all available training samples as prototypes ( P = T ). They predict the class of a new unseen query point v ∈ R n by a majority vote on the class labels of the k nearest neighbors in P and rk D v the rank function over pairwise distances d(., .) between v and the members of P The k-NN is therefore parameterized by the chosen neighborhood size k and the chosen distance measure d. The computational complexity of applying k-NN corresponds to the complexity of finding k times the minimum distance between the training samples and the query sample O (k|î||T |) . The factor |î| can again be removed, if the k-NN is embedded in an exhaustive FSS experiment.
Estimation of k-nearest neighbor cross-validation complexity. The computational costs for a r × f CV evaluation of the k-NN can be reduced by memorizing different aspects of the global data set S of all available samples. The number of distance calculations  www.nature.com/scientificreports/ can be reduced to |S | 2 by precalculating the global distance matrix. Here, T st and V st denote the training and test sets of a single experiment. We assume that |S | is dividable by f. The number of distance comparisons can be reduced via a related strategy (Fig. 3). For a given test sample v ∈ V st , it is likely that there is an overlap between its k nearest neighbors in the current training set T st and the k nearest neighbors among the samples in S . The second one will be called global nearest neighbors in the following. If one of the global nearest neighbors is included in the current training set the corresponding search ( |T st | distance comparisons) can be replaced by a single lookup. If the current k nearest neighbors correspond to the k global nearest neighbors the prediction of v is equal for the current training/test split and LO. The occurrence of a successful lookup lu is distributed according to a hypergeometric distribution lu ∼ HG(k, |S | − 1, |T |) . In expectation nearest neighbors can be found for each test sample.
For the |V | validation samples in each of the r × f in each folds, the following number unsuccessful lookups is expected This reduces the number of required distance comparisons to at least In contrast to the distance calculations, the lookup table for the distance comparisons must be calculated explicitly. The costs for this lookup table are approximately given by k|S | 2 distance comparisons. The overall time complexity of a single k-nearest neighbor experiment summarizes then to The space complexity of the additional lookup structure is O k|S | .

Cross-validation error bound.
Runtime optimisation via lookups can also be used to compute lower and upper bounds on the cross-validation error R cv of k-NN. This bound depends on the classifier's leave-one-out error and the number of folds f.
Let R cv =R cv (S ) and R lo =R lo (S ) denote the risk estimate of a f-fold cross-validation and a leave-one-out cross-validation respectively. Given a random split of an f-fold cross-validation, the global k nearest neighbors of any data point are also included in the current training set T with probability   Table 2.
With probability p lo , CV and LO lead to the same classification of a data point (cf. Eq. 18). For those cases in which the global k nearest neighbor is not included in the training set, let R u denote the (unknown) error rate of the resulting predictions. Now the expected CV error can be written as where E F denotes the expectation value under the uniform distribution F of all possible splits into f folds. Since

Experiments
Our experiments focus on three different aspects of the exhaustive FSS. Its runtime ("Runtime experiments" section), fitness landscape ("Exhaustive screening experiments" section) and generalization ability ("Cross-validation experiments" section) are evaluated. In all experiments, the e-k-NN is validated for k ∈ {1, 3, 5, 7} . Real-world datasets were downloaded from the UCI machine learning repository 48 (Table 3)   www.nature.com/scientificreports/ on artificial datasets with m ∈ {50, 100} samples and n ∈ {30, . . . , 40} features. Each sample x was drawn i.i.d from x ∼ U (0, 1) n . A maximal time limit of 14 days ( ≈ 1.21 · 10 9 ms) was used. The exhaustive enumeration scheme is compared to a de novo calculation of all distance matrices. For the runtime evaluation of the e-k-NN a 10 × 10 CV is used as an internal validation method. It is performed with and without the use of the lookup strategy ("Estimation of k-nearest neighbor cross-validation complexity" section). All experiments are based on multi-core evaluations on p = 100 cores.
Exhaustive screening experiments. The exhaustive FSS presented above provides a census on the CV accuracies achieved by k-NN over all feature combinations. The corresponding fitness landscape can be evaluated and visualized in order to provide information on the influence of individual features and the potential success of FSS.
In an explorative analysis, we provide the (exact) distributions of accuracies for several real datasets (Table 3) in order to identify patterns that may hint at a success or failure FSS. The accuracies are organized in histograms according to their underlying signature size n.

Cross-validation experiments.
The e-k-NN can be applied as a standalone classification algorithm, which itself can be evaluated in (outer) 10 × 10 CV experiments. We utilized 8 UCI datasets (Table 3)  As the e-k-NN is trained 10 × 10 on different training sets, its internal exhaustive FSS will return up to 100 feature signatures in these experiments. The mean number of features and their standard deviation will be reported.

Results
The following section provides the results of the three experimental setups.  Figure 4 provides the results of the exhaustive distance calculations. For m = 50 samples the (standard) de novo calculation achieved the enumeration for n = 40 in 9d, 21h, 9min. The proposed traversal strategy required 15h 8min. For m = 100 samples the de novo calculation was only able to handle a dimensionality of n = 38 in the given time limit (9d, 17h, 35min). The traversal strategy was able to handle n = 40 in 2d, 20h, 58min.

Runtime.
The evaluation of the e-k-NN strategy is given in Fig. 5. In this experiment, the enumeration of the distance matrices is coupled to the workload of a 10 × 10 CV. The e-k-NN was applied with and without the use of its   www.nature.com/scientificreports/ Fitness landscape. In our explorative analysis, we recorded the (internal) CV accuracy of each feature combination. A summary for k = 1 is given in Fig. 6. An overview of the remaining experiments ( k ∈ {3, 5, 7} ) is given in the supplementary information. All figures organize the accuracies according to the feature set sizes (x-axis), where each column provides a histogram of the feature combinations of size n. For each dataset, the leftmost histogram shows the accuracies achieved by the n individual features ( n = 1 ). It characterizes the performance of the features as univariate standalone markers. Over all datasets, standalone markers achieved accuracies in the range of [14.2%; 87.2%] . For the individual datasets, the largest range of [14.4%; 64.5%] was observed for d 6 (k = 7) . The smallest one of [31.8%; 37.6%] was found for d 1 (k = 1) . For all experiments the multivariate feature combinations ( n > 1 ) achieve better maximal accuracies than the univariate ones. In general, the quantity increases already for small feature set sizes.
Higher ranges of accuracies are observed for multivariate feature combinations. While the ranges initially increase for smaller n they decrease for n → n . This might be caused by the underlying number of combinations n n , which is maximal for n = n/2 . Another reason might be the overlap of the corresponding feature signatures. While feature signatures can be constructed from distinct features for smaller n , they inevitably overlap for larger n resulting in similar distance information.
The rightmost histogram (single bar) provides the accuracy gained by the full set of features ( n = n ). For each experiment, feature subsets exist that outperform the full set. This can be observed for 10.0% to 100.0% of the feature set sizes (mean: 84.0% ). Feature subsets leading to the highest accuracies utilize 14.3% to 91.7% of all features with a mean value of 48.2% . The number of feature combinations with better or equal accuracies than the full set of features might be seen as an indicator for a successful feature selection in terms of improved accuracy or reduced feature set sizes. In our experiments, these numbers range from 95.4% ( d 1 ) to 0.2% ( d 8 ). It is interesting to see that especially the multivariate feature combinations tend to form multimodal distributions (according to accuracy) indicating the existence of multiple quality classes of feature combinations. These classes are mainly determined by the presence or absence of specific feature combinations, which dominate the influences of the corresponding signature. For larger feature set sizes the different subpopulations are likely to be absorbed by the overall central tendency.
Cross-validation results. The results of the CV experiments are summarized in Table 4. Here, the classification accuracies in external 10 × 10 CV experiments and the number of necessary features for e-k-NN leading to maximal accuracies are depicted.
In total, 7 out of 8 datasets obtained higher accuracies using e-k-NN instead of k-NN. Here, in 7 of 8 cases e-1-NN outperformed 1-NN, in 6 of 8 experiments e-3-NN led to higher accuracies than 3-NN, for 6 of 8 datasets e-5-NN outperformed 5-NN and in 7 of 8 cases e-7-NN achieved a better performance than 7-NN. The predictive performance of k-NN can be improved by up to 28.8% using e-k-NN. On average, an improvement of up to 11.2% (e-5-NN) in accuracies can be observed.
The variants of e-k-NN selected [14.4%; 88.3%] (average: 42.8% ) of all features. The lowest feature numbers were needed by d 3 while outperforming the traditional k-NN by at least 15.2% . d 8 required the largest feature set, leading to a decline in accuracy by at least 0.4% compared to k-NN.
For 5 out of 8 datasets, the overall highest accuracies were achieved by a variant of e-k-NN. The standard k-NN performed best for 1 out of 8 datasets. In 1 out of 8 cases the L2-SVM and C4.5 outperformed the other classifiers, respectively. Over all 8 datasets, the L1-SVM, CART, LVQ and NCC never achieved the best performance.

Discussion and conclusion
The design of a suitable feature subset is one of the major tasks in training a classification model with a high impact on its final predictive performance. Adapted to the same task a classification model might be highly predictive or non-predictive if based on different feature representations of a data collection. This effect is not only of interest for modelling purposes but also in a broader scientific context, where informative features might reveal new hypotheses. The second one extends the traditional scope of feature subset selection from higherdimensional settings to lower-dimensional ones.
In both scenarios, high coverage of all feature combinations would be desirable for detecting optimal solutions, characterizing their neighborhoods, identifying potential alternatives and finally for outlining the landscape of the underlying search space. However, this aim becomes more and more unrealistic for higher dimensions due to the exponential nature of the exponential growth of the number of feature combinations. Here, additional evaluations can enlarge the exploration of the search space. Those aims can be supported by fast evaluation criteria as they allow for higher coverage in the same amount of time. Nevertheless, these evaluation criteria should also be robust to counteract the effects of overfitting. Time-consuming re-or subsampling experiments might be required to ensure this property. Complexity reduction should therefore focus on this aspect.
In this work, we address this scenario by proposing two techniques for efficient cross-validation traversals that improve the runtime of distance-based feature subset selection and even extend the range of exhaustive feature subset evaluations. Both are accompanied by theoretical findings on their computational complexity and error bounds. The first one, a traversal strategy, reduces the computational complexity of multivariate distance calculations to the complexity of univariate ones. Dimensionality therefore no longer affects the generation of distance  www.nature.com/scientificreports/ matrices, if feature selection experiments are performed incrementally as for example in forward-selection or backward-elimination wrappers. We show the potential of these traversals in exhaustive feature selection experiments, which comprise an exponential number of distance evaluations. For these enumerations, we provide an optimal walkthrough that not only allows the use of univariate distance calculations for each feature signature but also the use of arbitrary splits for load-balanced parallelization. In our experiments, the dimensionalityinvariant techniques allowed for the evaluation of 2 40 − 1 marker combinations. This number is independent of the dimensionality of the feature subsets. It is clearly out of range for de novo calculations, which become even more complex for higher dimensions. The second approach addresses the computational complexity of cross-validation experiments of distancebased classifiers. This type of experiment is mainly designed for simulations of independent training and test runs and therefore assumes a de novo calculation of classification models and predictions. Nevertheless, memoization techniques might be used to improve training and evaluation time. For the k-Nearest Neighbor classifier, memoization can be applied twice. First, the repetitive calculation of distances between training and test samples can be replaced by providing a global distance matrix as lookup structure. Second, the global k nearest neighbors extracted from this matrix are likely to correspond to the local k nearest neighbors of an individual experiment, which allows for a probabilistic retrieval.
The memoization of k nearest neighbors reveals an inherent relation of the leave-one-out (global structure) and other cross-validations (local structures), which can not only be applied for runtime improvement but also for error estimation. We provide a theoretical error bound on the general cross-validation error based on the error of the leave-one-out. It might be used for replacing standard cross-validation by leave-one-out experiments if upper bounds are sufficient.
Our experiments did not only allow the comparison of runtimes but also the exploration of exhaustive feature selection experiments. Although typically omitted due to their exponential computational complexity exhaustive feature selection experiments have some interesting theoretical properties. In contrast to (standard) heuristic or stochastic feature selection strategies, exhaustive feature selection -per definition-does not suffer from subsampling effects. Global optimal solutions can not be missed. It therefore allows to provide a direct relation between the chosen optimization criterion and the analyzed data. Although other optimization techniques also allow to provide (global) optimal solutions for specific criteria, the exhaustive one is the only general purpose optimization technique with this property. It is therefore of interest for a wider range of applications.
Exhaustive feature selection achieves this property by conducting a census on all feature combinations. The corresponding data can afterwards be used for providing a fitness landscape, which outlines feature interactions and their influence on a classifier's performance. For the k Nearest Neighbor classifier, we observed a beneficial influence of predictive standalone markers. They are a major ingredient of the top-scoring features combinations. The more features a signature reassembles the higher gets its similarity to the full signature. This not only affects the similarity of the corresponding distance matrices but also the similarity of the induced accuracies. As the k Nearest Neighbor classifier is unable to conduct (internal) feature-wise decisions its performance is influenced by Table 4. Results of the 10 × 10 CV experiments. The table reports the achieved accuracies and in the number of selected features (mean ± stdev. in %). k-NN denotes conventional CV simulation, e-k-NN refers to exhaustive FSS. The highest mean accuracy per dataset is printed in bold. www.nature.com/scientificreports/ all components of the underlying feature signature. Although we observe synergistic effects especially for smaller feature signatures, larger ones are likely to be overshadowed by uninformative components leading to a decline of accuracy. Individual features can only dominate the performance of the overall classifier if they outperformed the remaining features in scale. This effect might be different for other types of classifiers. Therefore, our future work will focus on the generation of fitness landscapes for other classification models, e.g., the L2-SVM and NCC. While our scheme can be transferred to other distance-based classifiers, exhaustive screens might not as easily be conducted for other types of models. Other acceleration techniques will be required for these model classes.

Data availability
All data generated or analysed during this study are included in this published article and its supplementary information file. Artificial data was generated by random sampling from a uniform distribution (see "Runtime experiments" section). The real-world data that was analysed is publicly available from the UCI machine learning repository https:// archi ve. ics. uci. edu.